High-Resolution Image Synthesis with Latent Diffusion Models
https://gyazo.com/2d012ca999f667b2ba0210021d740338
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at this https URL .
Diffusion Model (DM) decomposes the image formation process into the sequential application of denoising autoencoders to achieve state-of-the-art synthesis results for image data and beyond. Furthermore, its formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days, and inference by sequential evaluation is also expensive To enable training of DMs with limited computational resources while maintaining DM quality and flexibility, we have developed powerful pre-trained auto apply DM in the latent space of encoders. Unlike previous work, learning diffusion models in this representation allows us to reach the optimal point between complexity reduction and detail preservation for the first time and to significantly improve visual fidelity. The introduction of a cross-attention layer in the model architecture also makes the diffusion model a powerful and flexible generator for common conditional inputs such as text and bounding boxes, allowing for high-resolution synthesis in a convolutional fashion. Our Latent Diffusion Model (LDM) is computationally much less expensive than pixel-based DM, while achieving a new technical level for image inpainting and for a variety of tasks including unconditional image generation, semantic scene composition, super-resolution (e.g., video), and We achieved a high level of competitiveness. The code is available at this https URL The diffusion model is the decomposition of the image formation process into the sequential application of noise reduction autoencoders
Guiding mechanisms can be implemented to control the image generation process without retraining
Talk about controlling images generated by text prompts.
Doing this in pixel space is expensive
So we do it in the latent vector space
https://gyazo.com/97dccbab020e9d43e0e0a152051d5325
Introducing a cross-attention layer to the model architecture makes the diffusion model a powerful and flexible generator for common conditional inputs such as text and bounding boxes, and enables high-resolution synthesis in a convolutional fashion. ---